Abnormality detection and surveillance system

ABSTRACT

A surveillance system having at least one primary video camera for translating real images of a zone into electronic video signals at a first level of resolution. The system includes means for sampling movements of an individual or individuals located within the zone from the video signal output from at least one video camera. Video signals of sampled movements of the individual is electronically compared with known characteristics of movements which are indicative of individuals having a criminal intent. The level of criminal intent of the individual or individuals is then determined and an appropriate alarm signal is produced.

FIELD OF THE INVENTION

This invention generally relates to surveillance systems, and moreparticularly, to trainable surveillance systems which detect and respondto specific abnormal video and audio input signals.

BACKGROUND OF THE INVENTION

Today's surveillance systems vary in complexity, efficiency andaccuracy. Earlier surveillance systems use several closed circuitcameras, each connected to a devoted monitor. This type of system workssufficiently well for low-coverage sites, i.e., areas requiring up toperhaps six cameras. In such a system, a single person could scan thesix monitors, in “real” time, and effectively monitor the entire (albeitsmall) protected area, offering a relatively high level of readiness torespond to an abnormal act or situation observed within the protectedarea. In this simplest of surveillance systems, it is left to thediscretion of security personnel to determine, first, if there is anyabnormal event in progress within the protected area, second, the levelof concern placed on that particular event, and third, what actionsshould be taken in response to the particular event. The reliability ofthe entire system depends on the alertness and efficiency of the workerobserving the monitors.

Many surveillance systems, however, require the use of a greater numberof cameras (e.g., more than six) to police a larger area, such as atleast every room located within a large museum. To adequately ensurereliable and complete surveillance within the protected area, eithermore personnel must be employed to constantly watch the additionallyrequired monitors (one per camera), or fewer monitors may be used on asimple rotation schedule wherein one monitor sequentially displays theoutput images of several cameras, displaying the images of each camerafor perhaps a few seconds. In another prior art surveillance system(referred to as the “QUAD” system), four cameras are connected to asingle monitor whose screen continuously and simultaneously displays thefour different images. In a “quaded quad” prior art surveillance system,sixteen cameras are linked to a single monitor whose screen nowdisplays, continuously and simultaneously all sixteen different images.These improvements flow fewer personnel to adequately supervise themonitors to cover the larger protected area.

These improvements, however, still require the constant attention of atleast one person. The above described multiple-image/single screensystems suffered from poor resolution and complex viewing. Thereliability of the entire system is still dependent on the alertness andefficiency of the security personnel watching the monitors. Thepersonnel watching the monitors are still burdened with identifying anabnormal act or condition shown on one of the monitors, determiningwhich camera, and which corresponding zone of the protected area isrecording the abnormal event, determining the level of concern placed onthe particular event, and finally, determining the appropriate actionsthat must be taken to respond to the particular event.

Eventually, it was recognized that human personnel could not reliablymonitor the “real-time” images from one or several cameras for long“watch” periods of time. It is natural for any person to become boredwhile performing a monotonous task, such as staring at one or severalmonitors continuously, waiting for something unusual or abnormal tooccur, something which may never occur.

As discussed above, it is the human link which lowers the overallreliability of the entire surveillance system. U.S. Pat. No. 4,737,847issued to Araki et al. discloses an improved abnormality surveillancesystem wherein motion sensors are positioned within a protected area tofirst determine the presence of an object of interest, such as anintruder. In the system disclosed by U.S. Pat. No. 4,737,847, zoneshaving prescribed “warning levels” are defined within the protectedarea. Depending on which of these zones an object or person is detectedin, moves to, and the length of time the detected object or personremains in a particular zone determines whether the object or personentering the zone should be considered an abnormal event or a threat.

The surveillance system disclosed in U.S. Pat. No. 4,737,847 does removesome of the monitoring responsibility otherwise placed on humanpersonnel; however, such a system can only determine an intruder's“intent” by his presence relative to particular zones. The actualmovements and sounds of the intruder are not measured or observed. Askilled criminal could easily determine the warning levels of obviouszones within a protected area and act accordingly; spending little timein zones having a high warning level, for example.

It is therefore an object of the present invention to provide asurveillance system which overcomes the problems of the prior art.

It is another object of the invention to provide such a surveillancesystem wherein a potentially abnormal event is determined by a computerprior to summoning a human supervisor.

It is another object of the invention to provide a surveillance systemwhich compares specific measured movements of a particular person orpersons with a trainable, predetermined set of “typical” movements todetermine the level and type of a criminal or mischievous event.

It is another object of this invention to provide a surveillance systemwhich transmits the data from various sensors to a location where it canbe recorded for evidentiary purposes. It is another object of thisinvention to provide such a surveillance system which is operational dayand night.

It is another object of this invention to provide a surveillance systemwhich can cull out real-time events which indicate criminal intent usinga weapon, by resolving the low temperature of the weapon relative to thehigher body temperature and by recognizing the stances taken by theperson with the weapon.

It is yet another object of this invention to provide a surveillancesystem which eliminates or reduces the number of TV monitors and guardspresently required to identify abnormal events, as this system willperform this function in near real time.

INCORPORATED BY REFERENCE

The content of the following references is hereby incorporated byreference.

1. Motz L. and L. Bergstein “Zoom Lens Systems”, Journal of OpticalSociety of America, 3 papers in Vol. 52, 1992.

2. D. G. Aviv, “Sensor Software Assessment of Advanced Earth ResourcesSatellite Systems”, ARC Inc. Report #70-80-A, pp. 2-107 through 2-119;NASA contract NAS-1-16366.

3. Shio, A. and J. Sklansky “Segmentation of People in Motion”, Proc. ofIEEE Workshop on Visual Motion, Princeton, N.J., October 1991.

4. Agarwal. R. and J Sklansky “Estimating Optical Flow from ClusteredTrajectory Velocity Time”.

5. Suzuki, S. and J Sklansky “Extracting Non-Rigid Moving Objects byTemporal Edges”, IEEE, 1992, Transactions of Pattern Recognition.

6. Rabiner, L. and Biing-Hwang Juang “Fundamental of SpeechRecognition”, Pub. Prentice Hall, 1993, (p.434-495).

7. Weibel, A. and Kai-Fu Lee Eds. “Readings in Speech Recognition”, Pub.Morgan Kaaufman, 1990 (p.267-296).

8. Rabiner L. “Application of Voice Processing to Telecommunication”,Proc. IEEE, Vol. 82, No. 2, February, 1994.

SUMMARY OF THE INVENTION

A preferred embodiment of the herein disclosed invention involves asurveillance system having at least one primary video camera fortranslating real images of a zone into electronic video signals at afirst level of resolution and means for sampling movements within thezone from the video camera output. These elements are combined withmeans for electronically comparing the sampled movements with knowncharacteristics of movements which are indicative of individuals engagedin criminal activity and means for determining the level of suchcriminal activity. Associated therewith are means for activating atleast one secondary sensor and associated recording device having asecond higher level of resolution, said activating means being inresponse to determining a predetermined level of criminal activity.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic block diagram of the video, analysis, control,alarm and recording subsystems of an embodiment of this invention;

FIG. 2A illustrates a frame K of a video camera's output of a particularenvironment, according to the invention, showing four representativeobjects (people) A, B, C, and D, wherein objects A, B and D are movingin a direction indicated with arrows, and object C is not moving;

FIG. 2B illustrates a frame K+5 of the video camera's output. accordingto the invention, showing objects A, B, and D are stationary, and objectC is moving;

FIG. 2C illustrates a frame K+10 of the video camera's output, accordingto the invention, showing the current location of object A, B, C, D, andE;

FIG. 2D illustrates a frame K+11 of the video camera's output, accordingto the invention, showing object B next to object C, and object E movingto the right;

FIG. 2E illustrates a frame K+12 of the video camera's output, accordingto the invention, showing a potential crime taking place between objectsB and C;

FIG. 2F illustrates a frame K+13 of the video camera's output, accordingto the invention, showing objects B and C interacting;

FIG. 2G illustrates a frame K+15 of the video camera's output, accordingto the invention, showing object C moving the right and object Bfollowing;

FIG. 2H illustrates a frame K+16 of the video camera's output, accordingto the invention, showing object C moving away from a stationary objectB;

FIG. 2I illustrates a frame K+17 of the video camera's output, accordingto the invention, showing object B moving towards object C;

FIG. 3A illustrates a frame of a video camera's output, according to theinvention, showing a “two on one” interaction of objects (people) A, B,and C;

FIG. 3B illustrates a later frame of the video camera's output of FIG.3A, according to the invention, showing objects A and C moving towardsobject B;

FIG. 3C illustrates a later frame of the video camera's output of FIG.3B, according to the invention, showing objects A and C moving in closeproximity to object B;

FIG. 3D illustrates a later frame of the video camera's output of FIG.3C, according to the invention, showing objects A and C quickly movingaway from object B;

FIG. 4 is a schematic block diagram of a conventional word recognitionsystem which may be employed in the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Referring to FIG. 1, the picture input means 10, may be any conventionalelectronic picture pickup device operational within the infrared orvisual spectrum (or both) including a vidicon and a CCD/TV camera ofmoderate resolution, e.g., a camera about 1½ inches in length and about1 inch in diameter, weighing about 3 ounces, including for particulardeployment a zoom lens attachment. This device is intended to operatecontinuously and translate the field of view (“real”) images within afirst observation area into conventional video electronic signals.

Alternatively, a high rate camera/recorder, up to 300 frames/see(similar to those made by NAC Visual Systems of Woodland Hills, Calif.,SONY and others) may be used as the picture input means 10. This wouldenable the detection of even the very rapid movement of body parts thatare indicative of criminal intent, and their recording, as hereinbelowdescribed. The more commonly used camera operates at 30 frames persecond and cannot capture such quick body movement with sufficientresolution.

Picture input means 10, instead of operating continuously, may beactivated by an “alert” signal from the processor of the low resolutioncamera or from the audio/word recognition processor when sensing asuspicious event.

Picture input means 10 contains a preprocessor which normalizes a widerange of illumination levels, especially for outside observation. Thepreprocessor emulates a vertebrate's retina, which has a an efficientand accurate normalization process. One such preprocessor (VLSI retinachip) is fabricated by the Carver Meade Laboratory of the CaliforniaInstitute of Technology in Pasadena, Calif. Use of this particularpreprocessor chip will increase the automated vision capability of thisinvention whenever variation of light intensity and light reflection mayotherwise weaken the picture resolution.

The signals from the picture input means 10 are converted into digitizedsignals and then sent to the picture processing means 12. The processormeans controlling each group of cameras will be governed by anartificial intelligence system, based on dynamic pattern recognitionprinciples, as further described below. Picture processing means 12includes an image raster analyzer which effectively segments each imageto isolate each pair of people. The image raster analyzer subsystem ofpicture processing means 12 segments each sampled image to identify andisolate each pair of objects (or people), and each “two on one” group ofthree people separately.

The “two on one” grouping represents a common mugging situation in whichtwo individuals approach a victim, one from in front of the victim andthe other from behind. The forward mugger tells the potential victimthat if he does not give up his money, (or watch, ring, etc.) the secondmugger will shoot him, stab or otherwise harm him. The group of threepeople will thus be considered a potential crime in progress and willtherefore be segmented and analyzed in picture processing means.

With respect to a zoom lens system useful as an element in the pictureinput means 10, the essentials of the zoom lens subsystem are describedin three papers written by L. Motz and L. Bergstein, in an articletitled “Zoom Lens Systems” in the Journal of Optical Society of America,Vol. 52, April, 1992. This article is hereby incorporated by reference.

The essence of the zoom system is to vary the focal length such that anobject being observed will be focused and magnified at its image plane.In an automatic version of the zoom system, once an object is in thecamera's field-of-view (FOV), the lens moves to focus the object ontothe camera's image plane. An error signal which is used to correct thefocus by the image planes is generated by a CCD array into two halvesand measuring the difference, segmenting in each until the object is atthe center. Dividing the CCD array into more than two segments, say fourquadrants, is a way to achieve automatic centering, as is the case withmono-pulse radar. Regardless of the number of segments, the error signalis used to generate the desired tracking of the object.

In a wide field-of-view (WFOV) operation, there may be more than oneobject, thus special attention is given to the design of the zoom systemand its associated software and firmware control. Assuming threeobjects, as is the “two on one” potential mugging threat describedabove, and that the three persons are all in one plane, one can programa shifting from one object to the next, from one face to another face,in a prescribed sequential order. Moreover, as the objects move withinthe WFOV they will be automatically tracked in azimuth and elevation. Inprinciple, the zoom would focus on the nearest object, assuming that themount of light on each object is the same so that the prescribedsequence starting from the closes object will proceed to the remainingobjects from, for example, right to left.

However, when the three objects are located in different planes, butstill within the camera's WFOV, the zoom, with input from thesegmentation subsystem of the picture analysis means 12 will focus onthe object closest to the right hand side of the image plane, and thenproceed to move the focus to the left, focusing on the next object andon the next sequentially.

In all of the above cases, the automatic zoom can more naturally chooseto home-in on the person with the brightest emission or reflection, andthen proceed to the next brightness and so forth. This would be a formof an intensity/time selection multiplex zoom system.

The relative positioning of the input camera with respect to the areaunder surveillance will effect the accuracy by which the image rasteranalyzer segments each image. In this preferred embodiment, it isbeneficial for the input camera to view the area under surveillance froma point located directly above, e.g., with the input camera mounted highon a wall, a utility tower, or a traffic light support tower. The heightof the input camera is preferably sufficient to minimize occlusionbetween the input camera and the movement of the individuals undersurveillance.

Once the objects within each sampled video frame are segmented (i.e.,detected and isolated), an analysis is made of the detailed movements ofeach object located within each particular segment of each image, andtheir relative movements with respect to the other objects.

Each image frame segment, once digitized, is stored in a frame by framememory storage of picture processing means 12. Each frame from thepicture input means 10 is subtracted from a previous frame alreadystored in processing means 12 using any conventional differencingprocess. The differencing process involving multiple differencing stepstakes place in the processing section 12. The resulting differencesignal (outputted from the differencing sub-section of means 12) of eachimage indicates all the changes that have occurred from one frame to thenext. These changes include any movements of the individuals locatedwithin the segment and any movements of their limbs, e.g., arms.

Referring to FIG. 3. a collection of differencing signals for each movedobject of subsequent sampled frames of images (called a “track”) allowsa determination of the type, speed and direction (vector) of each motioninvolved, processing which will extract acceleration, i.e., note ofchange of velocity: and change in acceleration with respect to time(called “jerkiness”), and correlating this with stored signatures ofknown physical criminal acts. For example, subsequent differencingsignals may reveal that an individual's arm is moving to a highposition, such as the upper limit of that arm's motion, i.e., above hishead) at a fast speed. This particular movement could be perceived, asdescribed below, as a hostile movement with a possible criminal activityrequiring the expert analysis of security personnel.

The intersection of two tracks indicates the intersection of two movedobjects. The intersecting objects, in this case, could be merely the twohands of two people greeting each other, or depending on othercharacteristics, as described below, the intersecting objects could beinterpreted as a fist of an assailant contacting the face of a victim ina less friendly greeting. In any event, the intersection of two tracksimmediately requires further analysis and/or the summoning of securitypersonnel. But the generation of an alarm, fight and sound deviceslocated, for example, on a monitor will turn a guard's attention only tothat monitor, hence the labor savings. In general however, friendlyinteractions between individuals is a much slower physical process thanis a physical assault vis-a-vis body parts of the individuals involved.Hence, friendly interactions may be easily distinguished from hostilephysical acts using current low pass and high pass filters, and currentpattern recognition techniques based on experimental reference data.

When a large number of sensors (called a sensor suite) are distributedover a large number of facilities, for example, a number of ATMs(automatic teller machines), associated with particular bank branchesand in a particular state or states and all operated under a single banknetwork control, then only one monitor is required.

A commercially available software tool may enhance object-movementanalysis between frames (called optical flow computation). With opticalflow computation, specific (usually bright) reflective elements, calledfarkles, emitted from the clothing and/or the body parts of anindividual of one frame are subtracted from a previous frame. The brightportions will inherently provide sharper detail and therefore will yieldmore accurate data regarding the velocities of the relative movingobjects. Additional computation, as described below, will provide dataregarding the acceleration and even change in acceleration or“jerkiness” of each moving part sampled.

The physical motions of the individuals involved in an interaction, willbe detected by first determining the edges of the of each person imaged.And the movements of the body parts will then be observed by noting themovements of the edges of the body parts of the individuals involved inthe interaction. The differencing process will enable the determinationof the velocity and acceleration and rate of acceleration of those bodyparts.

The now processed signal is sent to comparison means 14 which comparesselected flames of the video signals from the picture input means 10with “signature” video signals stored in memory 16. The signaturesignals are representative of various positions and movements of thebody ports of an individual having various levels of criminal intent.The method for obtaining the data base of these signature video signalsin accordance with another aspect of the invention is described ingreater detail below.

If a comparison is made positive with one or more of the signature videosignals, an output “alert” signal is sent from the comparison means 14to a controller 18. The controller 18 controls the operation of asecondary, high resolution picture input means (video camera) 20 and aconventional monitor 22 and video recorder 24. The field of view of thesecondary camera 20 is preferably at most, the same as the field of viewof the primary camera 10, surveying a second observation area. Therecorder 24 may be located at the site and/or at both a law enforcementfacility (not shown) and simultaneously at a court office or legalfacility to prevent loss of incriminating information due to tampering.

The purpose of the secondary camera 20 is to provide a detailed videosignal of the individual having assumed criminal intent and also toimprove false positive and false negative performance. This informationis recorded by the video recorder 24 and displayed on a monitor 22. Analarm bell or light (not shown) or both may be provided and activated byan output signal from the controller 20 to summon a supervisor toimmediately view the pertinent video images showing the apparent crimein progress and access its accuracy.

In still another embodiment of the invention, a VCR 26 is operatingcontinuously (using a 6 hour loop-tape, for example). The VCR 26 isbeing controlled by the VCR controller 28. All the “real-time” imagesdirectly from the picture input means 10 are immediately recorded andstored for at least 6 hours, for example. Should it be determined that acrime is in progress, a signal from the controller 18 is sent to the VCRcontroller 28 changing the mode of recording from tape looping mode tonon-looping mode. Once the VCR 26 is changed to a non-looping mode, thetape will not re-loop and will therefore retain the perhaps vitalrecorded video information of the surveyed site, including the crimeitself, and the events leading up to the crime.

When the non-looping mode is initiated, the video signal may also betransmitted to a VCR located elsewhere; for example, at a lawenforcement facility and, simultaneously to other secure locations ofthe Court and its associated offices.

Prior to the video signals being compared with the “signature” signalsstored in memory, each sampled frame of video is “segmented” into partsrelating to the objects detected therein. To segment a video signal, thevideo signal derived from the vidicon or CCD/TV camera is analyzed by animage raster analyzer. Although this process causes slight signaldelays, it is accomplished nearly in real time.

At certain sites, or in certain situations, a high resolution camera maynot be required or otherwise used. For example, the resolution providedby a relatively simple and low cost camera may be sufficient. Dependingon the level of security for the particular location being surveyed, andthe time of day, the length of frame intervals between analyzed framesmay vary. For example, in a high risk area, every frame from the CCD/TVcamera may be analyzed continuously to ensure that the maximum amount ofinformation is recorded prior to and during a crime. In a low risk area,it may be preferred to sample perhaps every 10 frames from each camera,sequentially.

If, during such a sampling, it is determined that an abnormal orsuspicious event is occurring, such as two people moving very close toeach other, then the system would activate an alert mode wherein thesystem becomes “concerned and curious” in the suspicious actions and thesampling rate is increased to perhaps every 5 frames or even everyframe. As described in greater detail below, depending on the type ofsystem employed (i.e., video only, audio only or both), during such analert mode, the entire system may be activated wherein both audio andvideo system begin to sample the environment for sufficient informationto determine the intent of the actions.

Referring to FIG. 2, several frames of a particular camera output areshown to illustrate the segmentation process performed in accordancewith the invention. The system begins to sample at frame K anddetermines that there are four objects (previously determined to bepeople, as described below), A-D located within a particular zone beingpoliced. Since nothing unusual is determined from the initial analysis,the system does not warrant an “alert” status. People A, B, and D aremoving according to normal, non-criminal intent, as could be observed.

A crime likelihood is indicated when frames K+10 through K+13 areanalyzed by the differencing process. And if the movement of the bodyparts indicate velocity, acceleration and “jerkiness” that comparepositively with the stored digital signals depicting movements of knowncriminal physical assaults, it is likely that a crime is in progresshere.

Additionally, if a high velocity of departure is indicated when person Cmoves away from person B, as indicated in frames K+15 through K+17, alarger level of confidence, is attained in deciding that a physicalcriminal act has taken place or is about to.

An alarm is generated the instant any of the above conditions isestablished. This alarm condition will result in sending in Police orGuards to the crime site, activating the high resolution CCD/TV camerato record the face of the person committing the assault, a loud speakerbeing activated automatically, playing a recorded announcement warningthe perpetrator the seriousness of his actions now being undertaken anddemanding that he cease the criminal act. After dark a strong light willbe turned on automatically. The automated responses will be actuated theinstant an alarm condition is determined by the processor. Furthermore,an alarm signal is sent to the police station, and the same video signalof the event is transmitted to a court appointed data collection office,to the Public Defender's office and the District Attorney's Office.

As described above, it is necessary to compare the resulting signatureof physical body parts motion involved in a physical criminal act, thatis expressed by specific motion characteristics (i.e., velocity,acceleration, change of acceleration), with a set of signature files ofphysical criminal acts, in which body parts motion are equally involved.This comparison, is commonly referred to as pattern matching and is partof the pattern recognition process.

Files of physical criminal acts, which involve body parts movements suchas hands, arms, elbows, shoulder, head, torso, legs, and feet, can bereviewed to ascertain this pattern. In addition, a priority can be setby experiments and simulations of physical criminal acts gathered from“drama” that are enacted by professional actors, the data gathered fromexperienced muggers who have been caught by the police as well asvictims who have reported details of their experiences will help theactors perform accurately. Video of their motions involved in thesesimulated acts can be stored in digitized form and files prepared forsignature motion of each of the body parts involved, in the simulatedphysical criminal acts.

In another embodiment, the above described Abnormality Detection Systemincludes an RF-ID (Radio Frequency Identification) tag or card to assistin the detection and tracking of individuals within the field of view ofa camera. Such cards or tags could be used by authorized individuals torespond when queried by the RF interrogator. The response signal of thetags propagation pattern which is adequately registered with the videosensor. The card or tag, when sensed in video, would be assumed friendlyand authorized. This information would simplify the segmentationprocess.

A light connected to each RF-ID card will be turned ON, when a positiveresponse to an interrogation signal is established. The light willappear on the computer generated grid (also on the screen of themonitor) and the intersection of tracks clearly indicated, followed bytheir physical interaction. But also noted will be the intersectionbetween the tagged and the untagged individuals. In all of such cases,the segmentation process will be simpler.

There are many manufacturers of RF-ID cards and Interrogators, threemajor ones are, The David Samoff Research Center of Princeton, N.J.,AMTECH of Dallas, Tex. and MICRON Technology of Boise, Id.

The applications of the present invention include banks, ATMs, hotels,schools, residence halls and dormitories, office and residentialbuildings, hospitals, sidewalks, street crossings, parks, containers andcontainer loading areas, shipping piers, train stations, truck loadingstations, airport passenger and freight facilities, bus stations, subwaystations, theaters, concert halls, sport arenas, libraries, churches,museums, stores, shopping malls, restaurants, convenience stores, bars,coffee shops gasoline stations, highway rest stops, tunnels, bridges,gateways, sections of highways, toll booths, warehouses, and depots,factories and assembly rooms, law enforcement facilities includingjails. Any location or facility, civilian or military, requiringsecurity would be a likely application.

Further applications of this invention are in moving platforms:automobiles, trucks, buses, subway cars, train cars, both freight andpassenger, boats, ships (passenger and freight), tankers, service andconstruction vehicles, on and off-road, containers and their carriers,and airplanes, and also in equivalent military and sensitive mobileplatforms.

As a deterrence to car-jacking a tiny CCD/TV camera hidden in theceiling or the rearview mirror of the car, and focussed through a pinhole lens to the driver's seat, may be connected to the video processorto record the face of the drive. The camera is triggered by theautomatic word recognition processor that will identify the well knownexpressions commonly used by the car-jacker. The video picture will berecorded and then transmitted via cellular phone in the car. Without aphone, the short video recording of the face of the car-jacker will beheld until the car is found by the police, but now with the evidence(the picture of the car-jacker) in hand.

In this present surveillance system, the security personnel manning themonitors are alerted only to video images which show suspicious actions(criminal activities) within a prescribed observation zone. The securitypersonnel are therefore used to access the accuracy of the crime anddetermine the necessary actions for an appropriate response. By usingcomputers to effectively filter out all normal and noncriminal videosignals from observation areas, fewer security personnel are required tosurvey and “secure” a greater overall area (including a greater numberof observation areas, i.e., cameras).

It is also contemplated that the present system could be applied toassist blind people “see”. A battery operated portable version of thevideo system would automatically identify known objects in its field ofview and a speech synthesizer would “say” the object. For example,“chair”, “table”, etc. would indicate the presence of a chair and atable.

Depending on the area to be policed, it is preferable that at least twoand perhaps three cameras (or video sensors) are used simultaneously tocover the area. Should one camera sense a first level of criminalaction, the other two could be manipulated to provide a threedimensional perspective coverage of the action. The three dimensionalimage of a physical interaction in the policed area would allowobservation of a greater number of details associated with the steps:accost, threat, assault, response and post response. The conversionprocess from the two dimensional image to the three dimensional image isachieved by use of the known Radon transform.

In the extended operation phase of the invention as more details of thephysical variation of movement characteristics of physical threats andassaults against a victim and also the speaker independent (male, femaleof different ages groups) and dialect independent words and tersesentences, with corresponding responses, will enable automaticrecognition of a criminal assault, without he need of guard, unlessrequired by statutes and other external requirements.

In another embodiment of the present invention, both video and acousticinformation is sampled and analyzed. The acoustic information is sampledand analyzed in a similar manner to the sampling and analyzing of theabove-described video information. The audio information is sampled andanalyzed in a manner shown in FIG. 4, and is based on prior alt.

The employment of the audio speech band, with its associated AutomaticSpeech Recognition (ASR) system, will not only reduce the false alarmrate resulting from the video analysis, but can also be used to triggerthe video and other sensors if the sound threat predates the observedthreat.

Referring to FIG. 4, a conventional automatic word recognition system isshown, including an input microphone system 40, an analysis subsystem42, a template subsystem 44, a pattern comparator 46, and apost-processor and decision logic subsystem 48.

In operation, upon activation, the acoustic/audio policing system willbegin sampling all (or a selected portion) of nearby acoustic signals.The acoustic signals will include voices and background noise. Thebackground noise signals are generally known and predictable, and maytherefore be easily filtered out using conventional filteringtechniques. Among the expected noise signals are unfamiliar speech,automotive related sounds, honking, sirens, the sound of wind and/orrain.

The microphone input system 40 pick-up the acoustic signals andimmediately filter out the predictable background noise signals andamplify the remaining recognizable acoustic signals. The filteredacoustic signals are analyzed in the analysis subsystem 42 whichprocesses the signals by means of digital and spectral analysistechniques. The output of the analysis subsystem is compared in thepattern comparater subsystem 46 with selected predetermined words storedin memory in 44. The post processing and decision logic subsystem 48generates an alarm signal, as described below.

The templates 44 include perhaps about 100 brief and easily recognizableterse expressions, some of which are single words, and are commonly usedby those intent on a criminal act. Some examples of commonly used wordphrases spoken by a criminal to a victim prior to a mugging, forexample, include: “Give me your money”, “This is a stick-up”, “Give meyour wallet and you won't get hurt” . . . etc. Furthermore, commonlyused replies from a typical victim during such a mugging may also bestored as template words, such as “help”, and certain sounds such asshrieks, screams and groans, etc.

The specific word templates, from which inputed acoustic sounds arecompared with, must be chosen carefully, taking into account theparticular accents and slang of the language spoken in the region ofconcern. Hence, a statistical averaging of the spectral content of eachword must be used.

The output of the word recognition system shown in FIG. 4 is used as atrigger signal to activate a sound recorder, or a camera used elsewherein the invention, as described below.

The preferred microphone used in the microphone input subsystem 40 is ashot-gun microphone, such as those commercially available from theSennheiser Company of Frankfurt, Germany. These microphone have asuper-cardioid propagation pattern. However, the gain of the pattern maybe too small for high traffic areas and may therefore require more thanone microphone in an array configuration to adequately focus and trackin these areas. The propagation pattern of the microphone system enablesbetter focusing on a moving sound source (e.g., a person walking andtalking) A conventional directional microphone may also be used in placeof a shot-gun type microphone, such as those made by the SonyCorporation of Tokyo, Japan. Such directional microphones will achievesimilar gain to the shot-gun type microphones, but with a smallerphysical structure.

A feedback loop circuit (not specifically shown) originating in the postprocessing subsystem 48 will direct the microphone system to track aparticular dynamic source of sound within the area surveyed by videocameras.

An override signal from the video portion of the present invention willactivate and direct the microphone system towards the direction of thefield of view of the camera. In other words, should the video systemdetect a potential crime in progress, the video system will control theaudio recording system towards the scene of interest. Likewise, shouldthe audio system detect words of an aggressive nature, as describedabove, the audio system will direct appropriate video cameras tovisually cover and record the apparent source of the sound.

A number of companies have developed very accurate and efficient,speaker independent word recognition systems based on a hidden Markovmodel (HMM) in combination with an artificial neural network (ANN).These companies include IBM of Armonk, N.Y., AT&T Bell Laboratories,Kurtzwell of Cambridge, Mass. and Lernout and Hauspie of Belgium.

Put briefly, the HMM applies probabilistic statistical procedure inrecognizing words. In the training steps, an estimate is made of themeans and covariance of the probabilistic model of each word, e.g.,those words which are considered likely to be uttered in an interaction.The various ways which any given word is pronounced, permits thespectral parameters of the word to be an effective describer of themodel. The steps involved in recognizing an input of an unknown wordconsists of computing the likelihood that the word was generated by eachof the models developed during the training. The word is considered as“recognized” when its model gives the highest score. Finally, since thewords are composed of word units, the evaluation of conditionalprobabilities of one particular unit followed by the same or anotherword unit is also part of the computation.

The resulting list of potential words is considerably shorter than theentire list of all spoken words of the English language. Therefore, theHMM system employed with the present invention allows both the audio andvideo systems to operate quickly and use HMM probability statistics topredict future movements or words based on an early recognition ofinitial movements and word stems.

The HMM system may be equally employed in the video recognition system.For example, if a person's arm quickly moves above his head, the HMMsystem may determine that there is a high probability that the arm willquickly come down, perhaps indicating a criminal intent.

While certain embodiments of the invention have been described forillustrative purposes, it is to be understood that there may be variousother modifications and embodiments within the scope of the invention asdefined by the following claims.

1. A surveillance system, comprising: a) a video camera for translatingreal images of an area into electronic video signals; b) means forsampling movements of an individual located within the area from saidelectronic video signals of said video camera; c) means forelectronically comparing said sampled movements with predeterminedcharacteristics of movements; d) means for predicting future movementsof said individual based on said electronic comparing means of saidsampled movements; and e) means for generating a signal responsive topredetermined predicted future movements.
 2. The surveillance system inaccordance with claim 1, wherein said signal generating means activatesa video signal recorder for recording said video signals from saidcamera.
 3. The surveillance system in accordance with claim 1, whereinsaid signal generating means activates a microphone for receivingaudible information of said individual located in said area.
 4. Thesurveillance system in accordance with claim 1, wherein said signalgenerating means activates at least one high resolution camera.
 5. Asurveillance system, comprising: a video camera capable of generatingelectronic video signals based on real images of an area viewed by thevideo camera, the electronic video signals comprising a firstresolution; a movement sampler capable of sampling movements of at leastone individual in the generated electronic video signals; a movementcomparer capable of comparing sampled movements of the at least oneindividual with predetermined movement characteristics; a futuremovement predictor capable of predicting future movements of the atleast one individual based on the compared sampled movements of the atleast one individual with the predetermined movement characteristics;and an alert signal generator capable of generating an alert signalresponsive to predicted future movements.
 6. The surveillance systemaccording to claim 5, wherein the alert signal generator generates analert signal that is capable of activating a video signal recordercoupled to the electronic video signals.
 7. The surveillance systemaccording to claim 5, further comprising a video signal recorder capableof recording the electronic video signals.
 8. The surveillance systemaccording to claim 5, wherein the alert signal generator generates analert signal that is capable of activating a microphone that is capableof receiving audible information corresponding to the area viewed by thevideo camera.
 9. The surveillance system according to claim 8, furthercomprising an audio processor capable of filtering out predictablebackground audible information and capable of amplifying remainingaudible information.
 10. The surveillance system according to claim 8,further comprising an audio processor capable of recognizing speech inthe received audible information.
 11. The surveillance system accordingto claim 5, wherein the alert signal generator generates an alert signalthat is capable of activating a second video camera capable ofgenerating electronic video signals based on real images of the areaviewed by the second video camera, the electronic video signalsgenerated by the second video camera comprising a second resolution thatis greater than the first resolution.
 12. The surveillance systemaccording to claim 11, wherein the future movement predictor predictsfuture movements of the at least one individual further based on atleast one characteristic of a track associated with the at least oneindividual.
 13. The surveillance system according to claim 12, whereinthe at least one characteristic of the track comprise a movement type, amovement speed, a movement direction, a movement acceleration, amovement change in speed, a movement change in direction, or a movementchange in acceleration, or combinations thereof.
 14. The surveillancesystem according to claim 5, wherein the future movement predictorpredicts future movements of the at least one individual further basedon at least one characteristic of a track associated with the at leastone individual.
 15. The surveillance system according to claim 14,wherein the at least one characteristic of the track comprise a movementtype, a movement speed, a movement direction, a movement acceleration, amovement change in speed, a movement change in direction, or a movementchange in acceleration, or combinations thereof.
 16. The surveillancesystem according to claim 5, wherein the future movement predictor isfurther capable of analyzing detailed movements of the at least oneindividual with respect to at least one other individual or at least oneobject, or combinations thereof.
 17. The surveillance system accordingto claim 5, wherein the video camera is further capable of varying afocal length of the video camera in response to a video signal of the atleast one individual.
 18. The surveillance system according to claim 5,wherein the movement comparer is capable of pattern matching the atleast one sampled movement to a known movement pattern that isindicative of an individual having criminal intent.
 19. The surveillancesystem according to claim 18, wherein the future movement predictorpredicts future movements of the at least one individual further basedon at least one sampled movement matching the known movement patternthat is indicative of an individual having criminal intent.
 20. Thesurveillance system according to claim 19, wherein the at least onesampled movement that matches the known movement pattern that isindicative of an individual having criminal intent comprises acharacteristic comprising a movement type, a movement speed, a movementdirection, a movement acceleration, a movement change in speed, amovement change in direction, or a movement change in acceleration, orcombinations thereof.